Section 4 demonstrates the application of Polymetrics framework to a simple supervised learning (classification) problem.
In addition to Polymetrics, the following libraries are imported:
import pandas as pd
import numpy as np
import Polymetrics as poly
import FileImport
import polymetrics_config
import traceback
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import log_loss
The data used in the example is taken from a patent US2013/0046061 (Hermel-Davidock et al.). The patent lists four inventive samples (IS) and four comparative samples (CS). The distinction between inventive and comparative samples is made using a novel descriptor - Comonomer Distribution Constant (CDC). CDC is calculated solely using CEF plots and is primarily dependent on how the weight fraction data is distributed around its centroid. The lower the spread, the higher the CDC value. The inventive PE samples show CDC values greater than 45.
The spread in the data can also be determined independently by other statistical quantities such as standard deviation (STDEV), coefficient of variation (COV), median absolute deviation (MedianAD), interquartile range (IQR), etc. In this notebook, we shall see if these statistical descriptors can classify inventive/comparative samples as efficiently as the CDC descriptor.
The XLSXImport function prepares the imported polymer objects for further processing.
df_in = FileImport.XLSXImport("Article/Example_Dataset.xlsx", sheet_name = 'Data')
The relevant data is selected from the patent dataset.
# Filtering relevant data for the analysis
df_pat = df_in[(df_in['Project'] == 'US20130046061') & (df_in['Type'] == 'Resin_Developmental')]
Explanatory variables/features are developed for every polymer object by looping through the polymer objects sequentially. The additional features developed from the experimental data can be combined with the rest of the variables to make a features dataset.
The loop sequentially applies the BasicStats function on each polymer object. The features returned by the function are stored in FeaturesEng_df.
In the patent dataset, there are four inventive samples (IS) and four comparative samples (CS). The CEF data is provided for all the resins except for CS4. In absence of a CEF plot, the BasicStats function is not able to generate descriptors for the sample CS4. There are only seven data points for which the features can be calculated. The try/except method in Python acknowledges the error and continues the execution of the code.
FeaturesEng_df = pd.DataFrame(columns = df_pat.index)
fig = go.Figure() # Initiating an empty plotly figure
for i, polymer_ in zip(df_pat.index, df_pat.to_dict(orient="records")):
PE = poly.Polymer(polymer_)
try:
original_data, interpolated_data = PE.CEF_interpolate(minT = 30.0,
maxT = 110.0,
BaselineCorrection = True,
Smoothening = False, Plot = False)
fig.add_trace(go.Scatter(x=interpolated_data['T_interpolated'], # Adding traces sequentially to 'fig'
y=interpolated_data['Signal_interpolated'],
name=polymer_['Identifier'],
line_shape='linear'))
# Feature Building -
BasicStats_dict = PE.BasicStats(Interpolate = True, minT = 30.0 , maxT = 110.0)
FeaturesEng_dict = {**BasicStats_dict}
FeaturesEng_df.loc[:,i]=pd.Series(FeaturesEng_dict)
except Exception as error:
print('Polymetrics Error at', polymer_['Identifier'], repr(error))
pass
FeaturesEng_df = FeaturesEng_df.T
FeaturesEng_df = FeaturesEng_df.astype('float64')
display(pd.concat([df_pat['Identifier'], FeaturesEng_df], axis = 1))
Polymetrics Error at CS4 AttributeError("'Polymer' object has no attribute 'df_CEF'")
| Identifier | Mean | STDEV | COV | Median | IQR | MAD | MedianAD | AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | IS1 | 1.548435 | 2.337687 | 1.509710 | 0.195002 | 2.290279 | 1.918590 | 0.200268 | 124.123015 |
| 1 | IS2 | 1.166950 | 3.605523 | 3.089697 | 0.010088 | 0.313656 | 1.859791 | 0.031450 | 93.544473 |
| 2 | IS3 | 1.450821 | 2.758006 | 1.900997 | 0.197935 | 1.426555 | 1.806271 | 0.197935 | 116.298258 |
| 3 | IS4 | 1.217986 | 3.121087 | 2.562498 | 0.009226 | 0.462643 | 1.836478 | 0.012608 | 97.633961 |
| 4 | CS1 | 1.180044 | 1.358465 | 1.151199 | 0.416675 | 2.292036 | 1.206823 | 0.418894 | 94.592721 |
| 5 | CS2 | 1.153728 | 1.191008 | 1.032312 | 1.144547 | 1.581618 | 0.926675 | 0.927813 | 92.483233 |
| 6 | CS3 | 1.115792 | 1.099710 | 0.985587 | 0.811004 | 1.535052 | 0.872902 | 0.758815 | 89.442185 |
| 7 | CS4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The CEF plots are reproduced below for reference.
fig.show()
The FeaturesEng_df is then appended to the rest of the numeric data present in the df_pat DataFrame. We shall apply drop_correlated function to the Features DataFrame and retain the features of interest. The function will identify features from the rest of the dataset which are not correlated with the retained features.
df_pat_numeric = df_pat.select_dtypes(np.number) # numeric data is separated from the df_pat
Data = pd.concat([df_pat_numeric, FeaturesEng_df, df_pat['Classification']], axis=1) # the newly developed features are appended to the rest of the features
Data.dropna(inplace = True, how = 'any', axis = 0) # Removes CS4
Features = Data.drop('Classification', axis = 1)
Features = poly.drop_correlated(
Features, coeff = 0.75,
Retain = ['CDC', 'Unsat_1M_C', 'COV', 'STDEV', 'IQR', 'MedianAD'],
Drop = [], Plot = False)
display(Features)
Correlated variables in remaining data to drop ['Tc', 'delHc', 'Tm', 'delHm', 'Mn', 'Mw', 'Mz', 'ZSVR', 'I2', 'I10', 'Mean', 'MAD', 'AUC'] Variables correlated with the variables to retain ['Median']
| CDC | Unsat_1M_C | COV | STDEV | IQR | MedianAD | Density | |
|---|---|---|---|---|---|---|---|
| 0 | 63.8 | 62.0 | 1.509710 | 2.337687 | 2.290279 | 0.200268 | 0.912 |
| 1 | 140.9 | 71.0 | 3.089697 | 3.605523 | 0.313656 | 0.031450 | 0.937 |
| 2 | 152.4 | 78.0 | 1.900997 | 2.758006 | 1.426555 | 0.197935 | 0.912 |
| 3 | 98.9 | 73.0 | 2.562498 | 3.121087 | 0.462643 | 0.012608 | 0.916 |
| 4 | 27.5 | 145.0 | 1.151199 | 1.358465 | 2.292036 | 0.418894 | 0.916 |
| 5 | 25.9 | 314.0 | 1.032312 | 1.191008 | 1.581618 | 0.927813 | 0.916 |
| 6 | 16.1 | 252.0 | 0.985587 | 1.099710 | 1.535052 | 0.758815 | 0.914 |
We look at rank correlations between different features. STDEV and COV are well correlated with CDC. Since the correlation is not perfect, STDEV and COV do not rank the samples in the same order as that of CDC.
fig1 = px.imshow(Features.corr(method = 'spearman'),
color_continuous_scale=px.colors.diverging.BrBG,
color_continuous_midpoint=0,
width = 600,
height = 600,
title = 'Rank Correlation Matrix with Selected Features')
fig1.show()
We see how individual features would classify the samples. The predictive ability of individual features is checked by developing one feature classification models. The predictions and the log loss values are compared below.
clf = LogisticRegressionCV(cv = 2) #Estimator : logistic regression with 2-fold cross validation
YY = Data['Classification']
Prediction_dict = {'True Labels':Data['Classification'].to_numpy()}
logloss_dict = {'True Labels': [0.0]}
J = LogisticRegressionCV(cv = 2)
for col in Features.columns:
XX = Features[[col]]
clf = J.fit(XX, YY)
Prediction_dict[col]= clf.predict(XX)
logloss_dict[col] = log_loss(YY,clf.predict_proba(XX))
LL = pd.DataFrame.from_dict(Prediction_dict)
MM = pd.DataFrame.from_dict(logloss_dict)
print('-------Predictions-------')
display(LL)
print('-------Log Loss-------')
display(MM)
-------Predictions-------
| True Labels | CDC | Unsat_1M_C | COV | STDEV | IQR | MedianAD | Density | |
|---|---|---|---|---|---|---|---|---|
| 0 | Inventive | Inventive | Inventive | Inventive | Inventive | Comparative | Inventive | Inventive |
| 1 | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive |
| 2 | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive |
| 3 | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive | Inventive |
| 4 | Comparative | Comparative | Inventive | Comparative | Comparative | Comparative | Comparative | Inventive |
| 5 | Comparative | Comparative | Comparative | Comparative | Comparative | Inventive | Comparative | Inventive |
| 6 | Comparative | Comparative | Comparative | Comparative | Comparative | Inventive | Comparative | Inventive |
-------Log Loss-------
| True Labels | CDC | Unsat_1M_C | COV | STDEV | IQR | MedianAD | Density | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.247854 | 0.339189 | 0.324732 | 0.175448 | 0.56573 | 0.231993 | 0.682908 |
It can be seen that STDEV and medianAD are as effective as CDC in creating the decision boundary. Since the rank correlation between these features is not one, the statistical variables may not have same characteristics as CDC.